multimodal interaction
Building the Web for Agents: A Declarative Framework for Agent-Web Interaction
Schultze, Sven, Kietzmann, Meike Verena, Schönfeld, Nils-Lucas, Stock-Homburg, Ruth
The increasing deployment of autonomous AI agents on the web is hampered by a fundamental misalignment: agents must infer affordances from human-oriented user interfaces, leading to brittle, inefficient, and insecure interactions. To address this, we introduce VOIX, a web-native framework that enables websites to expose reliable, auditable, and privacy-preserving capabilities for AI agents through simple, declarative HTML elements. VOIX introduces
- Europe > Germany > Hesse > Darmstadt Region > Darmstadt (0.05)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture > Yokohama (0.04)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- (3 more...)
- Information Technology > Communications > Web (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
ERR@HRI 2.0 Challenge: Multimodal Detection of Errors and Failures in Human-Robot Conversations
Cao, Shiye, Stiber, Maia, Mahmood, Amama, Parreira, Maria Teresa, Ju, Wendy, Spitale, Micol, Gunes, Hatice, Huang, Chien-Ming
The integration of large language models (LLMs) into conversational robots has made human-robot conversations more dynamic. Yet, LLM-powered conversational robots remain prone to errors, e.g., misunderstanding user intent, prematurely interrupting users, or failing to respond altogether. Detecting and addressing these failures is critical for preventing conversational breakdowns, avoiding task disruptions, and sustaining user trust. To tackle this problem, the ERR@HRI 2.0 Challenge provides a multimodal dataset of LLM-powered conversational robot failures during human-robot conversations and encourages researchers to benchmark machine learning models designed to detect robot failures. The dataset includes 16 hours of dyadic human-robot interactions, incorporating facial, speech, and head movement features. Each interaction is annotated with the presence or absence of robot errors from the system perspective, and perceived user intention to correct for a mismatch between robot behavior and user expectation. Participants are invited to form teams and develop machine learning models that detect these failures using multimodal data. Submissions will be evaluated using various performance metrics, including detection accuracy and false positive rate. This challenge represents another key step toward improving failure detection in human-robot interaction through social signal analysis.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
Training-Free Multimodal Large Language Model Orchestration
Xie, Tianyu, Wu, Yuhang, Luo, Yongdong, Ji, Jiayi, Zheng, Xiawu
Different Multimodal Large Language Models (MLLMs) cannot be integrated into a unified multimodal input-output system directly. In previous work, training has been considered as an inevitable component due to challenges in modal alignment, Text-to-Speech efficiency and other integration issues. In this paper, we introduce Multimodal Large Language Model Orchestration, an effective approach for creating interactive multimodal AI systems without additional training. MLLM Orchestration leverages the inherent reasoning capabilities of large language models to coordinate specialized models through explicit workflows, enabling natural multimodal interactions while maintaining modularity, improving interpretability, and significantly enhancing computational efficiency. Our orchestration framework is built upon three key innovations: (1) a central controller LLM that analyzes user inputs and dynamically routes tasks to appropriate specialized models through carefully designed agents; (2) a parallel Text-to-Speech architecture that enables true full-duplex interaction with seamless interruption handling and natural conversational flow; and (3) a cross-modal memory integration system that maintains coherent context across modalities through intelligent information synthesis and retrieval, selectively avoiding unnecessary modality calls in certain scenarios to improve response speed. Extensive evaluations demonstrate that MLLM Orchestration achieves comprehensive multimodal capabilities without additional training, performance improvements of up to 7.8% over traditional jointly-trained approaches on standard benchmarks, reduced latency by 10.3%, and significantly enhanced interpretability through explicit orchestration processes.
- Asia > China > Fujian Province > Xiamen (0.05)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Workflow (1.00)
- Research Report (1.00)
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
Dong, Xin, Dong, Shichao, Wang, Jin, Huang, Jing, Zhou, Li, Sun, Zenghui, Jing, Lihua, Lan, Jingsong, Zhu, Xiaoyong, Zheng, Bo
Hallucinations in large vision-language models (LVLMs) pose significant challenges for real-world applications, as LVLMs may generate responses that appear plausible yet remain inconsistent with the associated visual content. This issue rarely occurs in human cognition. We argue that this discrepancy arises from humans' ability to effectively leverage multimodal interaction information in data samples. Specifically, humans typically first gather multimodal information, analyze the interactions across modalities for understanding, and then express their understanding through language. Motivated by this observation, we conduct extensive experiments on popular LVLMs and obtained insights that surprisingly reveal human-like, though less pronounced, cognitive behavior of LVLMs on multimodal samples. Building on these findings, we further propose \textbf{INTER}: \textbf{Inter}action Guidance Sampling, a novel training-free algorithm that mitigate hallucinations without requiring additional data. Specifically, INTER explicitly guides LVLMs to effectively reapply their understanding of multimodal interaction information when generating responses, thereby reducing potential hallucinations. On six benchmarks including VQA and image captioning tasks, INTER achieves an average improvement of up to 3.4\% on five LVLMs compared to the state-of-the-art decoding strategy. The code will be released when the paper is accepted.
- Europe > Switzerland > Zürich > Zürich (0.14)
- South America > Brazil > Paraná > Curitiba (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (2 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.67)
Multi-level Mixture of Experts for Multimodal Entity Linking
Hu, Zhiwei, Gutiérrez-Basulto, Víctor, Xiang, Zhiliang, Li, Ru, Pan, Jeff Z.
Multimodal Entity Linking (MEL) aims to link ambiguous mentions within multimodal contexts to associated entities in a multimodal knowledge base. Existing approaches to MEL introduce multimodal interaction and fusion mechanisms to bridge the modality gap and enable multi-grained semantic matching. However, they do not address two important problems: (i) mention ambiguity, i.e., the lack of semantic content caused by the brevity and omission of key information in the mention's textual context; (ii) dynamic selection of modal content, i.e., to dynamically distinguish the importance of different parts of modal information. To mitigate these issues, we propose a Multi-level Mixture of Experts (MMoE) model for MEL. MMoE has four components: (i) the description-aware mention enhancement module leverages large language models to identify the WikiData descriptions that best match a mention, considering the mention's textual context; (ii) the multimodal feature extraction module adopts multimodal feature encoders to obtain textual and visual embeddings for both mentions and entities; (iii)-(iv) the intra-level mixture of experts and inter-level mixture of experts modules apply a switch mixture of experts mechanism to dynamically and adaptively select features from relevant regions of information. Extensive experiments demonstrate the outstanding performance of MMoE compared to the state-of-the-art. MMoE's code is available at: https://github.com/zhiweihu1103/MEL-MMoE.
- Asia > China > Shanxi Province > Taiyuan (0.04)
- Europe > United Kingdom > Wales > Cardiff (0.04)
- Europe > France > Bourgogne-Franche-Comté > Doubs > Besançon (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.86)
Efficient Quantification of Multimodal Interaction at Sample Level
Yang, Zequn, Wang, Hongfa, Hu, Di
Interactions between modalities -- redundancy, uniqueness, and synergy -- collectively determine the composition of multimodal information. Understanding these interactions is crucial for analyzing information dynamics in multimodal systems, yet their accurate sample-level quantification presents significant theoretical and computational challenges. To address this, we introduce the Lightweight Sample-wise Multimodal Interaction (LSMI) estimator, rigorously grounded in pointwise information theory. We first develop a redundancy estimation framework, employing an appropriate pointwise information measure to quantify this most decomposable and measurable interaction. Building upon this, we propose a general interaction estimation method that employs efficient entropy estimation, specifically tailored for sample-wise estimation in continuous distributions. Extensive experiments on synthetic and real-world datasets validate LSMI's precision and efficiency. Crucially, our sample-wise approach reveals fine-grained sample- and category-level dynamics within multimodal data, enabling practical applications such as redundancy-informed sample partitioning, targeted knowledge distillation, and interaction-aware model ensembling. The code is available at https://github.com/GeWu-Lab/LSMI_Estimator.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > China > Beijing > Beijing (0.04)
- North America > Canada (0.04)
I2MoE: Interpretable Multimodal Interaction-aware Mixture-of-Experts
Xin, Jiayi, Yun, Sukwon, Peng, Jie, Choi, Inyoung, Ballard, Jenna L., Chen, Tianlong, Long, Qi
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.
- South America > Argentina > Patagonia > Tierra del Fuego Province > Estado Nacional (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > North Carolina > Orange County > Chapel Hill (0.04)
- (6 more...)
- Health & Medicine > Therapeutic Area > Neurology (0.95)
- Health & Medicine > Diagnostic Medicine (0.93)
- Health & Medicine > Health Care Technology (0.93)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.93)
- Information Technology > Data Science (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
BEARCUBS: A benchmark for computer-using web agents
Song, Yixiao, Thai, Katherine, Pham, Chau Minh, Chang, Yapei, Nadaf, Mazin, Iyyer, Mohit
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a "small but mighty" benchmark of 111 information-seeking questions designed to evaluate a web agent's ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing search inefficiencies and domain knowledge gaps as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI's Operator) reaching only 24.3% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
- Asia > Thailand > Bangkok > Bangkok (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > Maryland > Prince George's County > College Park (0.04)
- (3 more...)
- Information Technology > Communications > Web (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.53)
InterChat: Enhancing Generative Visual Analytics using Multimodal Interactions
Chen, Juntong, Wu, Jiang, Guo, Jiajing, Mohanty, Vikram, Li, Xueming, Ono, Jorge Piazentin, He, Wenbin, Ren, Liu, Liu, Dongyu
The rise of Large Language Models (LLMs) and generative visual analytics systems has transformed data-driven insights, yet significant challenges persist in accurately interpreting users' analytical and interaction intents. While language inputs offer flexibility, they often lack precision, making the expression of complex intents inefficient, error-prone, and time-intensive. To address these limitations, we investigate the design space of multimodal interactions for generative visual analytics through a literature review and pilot brainstorming sessions. Building on these insights, we introduce a highly extensible workflow that integrates multiple LLM agents for intent inference and visualization generation. We develop InterChat, a generative visual analytics system that combines direct manipulation of visual elements with natural language inputs. This integration enables precise intent communication and supports progressive, visually driven exploratory data analyses. By employing effective prompt engineering, and contextual interaction linking, alongside intuitive visualization and interaction designs, InterChat bridges the gap between user interactions and LLM-driven visualizations, enhancing both interpretability and usability. Extensive evaluations, including two usage scenarios, a user study, and expert feedback, demonstrate the effectiveness of InterChat. Results show significant improvements in the accuracy and efficiency of handling complex visual analytics tasks, highlighting the potential of multimodal interactions to redefine user engagement and analytical depth in generative visual analytics.
- Europe > Monaco (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Yolo County > Davis (0.04)
- (4 more...)
- Workflow (0.88)
- Research Report > New Finding (0.34)
- Information Technology (1.00)
- Materials > Metals & Mining > Steel (0.94)
- Banking & Finance > Trading (0.69)
- Health & Medicine (0.68)
Generative AI in Multimodal User Interfaces: Trends, Challenges, and Cross-Platform Adaptability
Bieniek, J., Rahouti, M., Verma, D. C.
As the boundaries of human computer interaction expand, Generative AI emerges as a key driver in reshaping user interfaces, introducing new possibilities for personalized, multimodal and cross-platform interactions. This integration reflects a growing demand for more adaptive and intuitive user interfaces that can accommodate diverse input types such as text, voice and video, and deliver seamless experiences across devices. This paper explores the integration of generative AI in modern user interfaces, examining historical developments and focusing on multimodal interaction, cross-platform adaptability and dynamic personalization. A central theme is the interface dilemma, which addresses the challenge of designing effective interactions for multimodal large language models, assessing the trade-offs between graphical, voice-based and immersive interfaces. The paper further evaluates lightweight frameworks tailored for mobile platforms, spotlighting the role of mobile hardware in enabling scalable multimodal AI. Technical and ethical challenges, including context retention, privacy concerns and balancing cloud and on-device processing are thoroughly examined. Finally, the paper outlines future directions such as emotionally adaptive interfaces, predictive AI driven user interfaces and real-time collaborative systems, underscoring generative AI's potential to redefine adaptive user-centric interfaces across platforms.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > Sweden > Östergötland County > Linköping (0.04)
- Research Report (1.00)
- Overview (1.00)